Finding instabilities in the community structure of complex networks 
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The problem of finding clusters in complex networks has been extensively studied by mathe- 
maticians, computer scientists and, more recently, by physicists. Many of the existing algorithms 
partition a network into clear clusters, without overlap. We here introduce a method to identify the 
nodes lying "between clusters" and that allows for a general measure of the stability of the clusters. 
This is done by adding noise over the weights of the edges of the network. Our method can in prin- 
ciple be applied with any clustering algorithm, provided that it works on weighted networks. We 
present several applications on real- world networks using the Markov Clustering Algorithm (MCL). 
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The framework of complex networks provides a re- 
markable tool for the analysis of complex systems con- 
sisting of many interacting entities 0,12 • Systems such 
as the Internet 0, the interaction map of proteins Q, 
social networks p], etc. have been successfully described 
by considering them as complex networks. Historically 
the first theoretical model to describe interacting com- 
plex systems was the Erdos-Renyi graph However, 
this model fails to describe several features observed fre- 
quently in real-world networks. The two most famous 
ones are the degree distribution Q and the clustering 
coefficient 0. Recently different models have been pro- 
posed to give a more realistic understanding of those fea- 
tures. 

Another characteristic of the topology of complex net- 
works is their cluster structure. In real-world networks, 
it is common to have small sets of nodes highly connected 
with each other but with only a few connections to the 
rest of the network. Finding the clusters of a network is 
a crucial point in order to understand its internal struc- 
ture. A large amount of clustering algorithms have been 
developed, each of them attemptin g to find a reasonably 
good partition of the network dlH El 13 In 
most of the cases those algorithms partition the network 
into non-overlapping clusters, assigning each node to a 
given cluster ("hard-clustering"). However, the resulting 
clustering is sometimes questionable, especially for nodes 
that "lie on the border" between two clusters. We design 
such nodes as unstable nodes. Fig.^shows a typical case 
where a node (7) lies exactly between two clear clusters. 

Defining and identifying unstable nodes is closely re- 
lated to the problem of evaluating the stability of the 
clustering. A first attempt was proposed by Wilkin- 
son [LH by modifying the Girvan-Newman algorithm . 
Recently several non-deterministic clustering algorithms 
have been developed 0, 0, . Using the stochasticity 
of the output, one can probe the stability of the cluster- 
ing. In this work, we introduce a general method to find 
unstable nodes and evaluate the stability of the clusters. 




FIG. 1: Small toy network with one unstable node (7). The 
clusters obtained without noise are labeled with different col- 
ors. Only probabilities pij < 0.8 are shown (dashed edges). 
r = 1.6 and a = 0.5 



Instead of having a stochastic element in the algorithm, 
we propose to introduce stochasticity in the network it- 
self and to use a hard-clustering algorithm (we chose the 
Markov Clustering Algorithm, MCL ^3 > but the method 
does not depend explicitly on this choice). The idea is to 
add a random noise over the weight of the edges of the 
network (in this study the noise added over the weight 
of the edges, initially equal to 1, is equally distributed 
between [— a, a], < a < 1). Noise in this context is not 
only a useful tool to reveal cluster instabilities, but it 
has actually a deeper interpretation. In many real-world 
networks, edges are often provided with some intrinsic 
weights, but usually no information is given about the 
uncertainties over these values. Adding some noise could 
fill this lack, although arbitrarily, to take into account 
the possible effects of uncertainties. 

Comparing how the clusters change from one noisy re- 
alization to another one provides informations that could 
not have been extracted with the standard clustering al- 
gorithms. For instance some nodes will "switch from one 
cluster to another" between different runs of the cluster- 
ing algorithm with noise (nodes 7 in Fig. Clusters 
are only determined by the nodes they are composed off. 
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Hence, what "switch from one cluster to another" means 
has to be defined more precisely. We first introduce a 
probability Py for the edge between node i and node j 
of connecting two nodes in the same cluster. After sev- 
eral runs of the clustering algorithm with the noise, one 
obtains a network where edges with p^ — 1 are always 
within a cluster and edges with a p^ close to connect 
two different clusters. Edges with a probability lower 
than a threshold 9 will be considered as external edges 
(typically 9 = 0.8). By removing those edges, one gets a 
disconnected network. Here, we use the word cluster for 
the clusters obtained without noise, and subcomponent 
for the disconnected parts of the network after the re- 
moval of the external edges. If the community structure 
of the network is stable under several repetitions of the 
clustering with noise, the subcomponents of the discon- 
nected network will correspond to the clusters obtained 
without noise. In the opposite case a new community 
structure will appear with some similarity with the ini- 
tial one. In order to identify which subcomponents cor- 
respond to the initial clusters, we introduce the notion of 
similarity between two sets of nodes. If E% (resp. E 2 ) is 
the set of clusters (resp. the set of subcomponents), we 
use the following definition of the similarity (s^) between 
cluster C\j £ E\ and subcomponent C 2 i £ E 2 : 



Sij — 
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For every C\j £ E\ , we find the subcomponent C 2 i , 1 < 
i < \E 2 \, with the maximal similarity and identify it with 
the cluster C\j (most of the time C 2 i corresponds to the 
stable core of the cluster Cy). If there is more than one 
of such subcomponents, none of them will be identified 
with the cluster. In practice, this latter case is extremely 
rare. 

For example, the network in Fig. ^ consists of 
three clusters (the three colors) and four subcompo- 
nents ({1,2,3,4,5,6}, {7}, {8,9,10,11,12,13,14,15,16,17}, 
{18,19,20}). Our method identifies the three biggest sub- 
components with the three clusters, while the subcompo- 
nent {7} is not identified with any cluster. 

Nodes belonging to subcomponents that have never 
been identified with any cluster could be defined as un- 
stable nodes. However in some cases a big cluster splits 
into two subcomponents of comparable size. Assuming 
that almost half of the nodes of the cluster are unstable 
is not realistic and one would rather define a new cluster. 
In practice, subcomponents of four nodes or more corre- 
spond often to a cluster not detected by the algorithm. 
We therefore define the unstable nodes as the nodes be- 
longing to subcomponents that have not been identified 
with a cluster and whose size is smaller than 4. 

Locally we can address the question of the stability of 
the clusters by looking at the probabilities of the edges 
inside each cluster and around a cluster. For instance 



if all edges inside the cluster have probability p t j = 1 
and all edges connecting the cluster to its neighbors have 
probability p^ — 0, we can conclude that the cluster is 
very stable. 

From a more global point of view, it is important to 
understand if the partition found by the clustering al- 
gorithm corresponds actually to a real cluster structure. 
We propose the entropy as a measure of the stability of 
the cluster structure. In first approximation, we assume 
that the are independent of each other and we define 
the average Clustering Entropy (CE) per edge as: 

S =— ^2iPij lo S2Kj + (1 - Pij) log 2 (l - Pij)}, 

where the sum is taken over all edges and m is the total 
number of edges in the network. If the network is totally 
unstable (i.e. in the most extreme for all 

edges), S = 1, while if the edges are perfectly stable 
under noise (p^ — or 1), S = 0. 

The value of S depends on the noise a. Nevertheless it 
allows for comparing with a network without predefined 
cluster structure. To avoid biasing the comparison, wc 
shall always compare the CE of a network with the one of 
a randomized version of the network in which the degree 
of each node is conserved 0, 0], using the same a. 
The randomized network plays the role of a null-model 
since the initial clusters (if present) are destroyed by the 
rewiring process. Note however that we do not assume 
the randomized network to have no apparent community 
structure ji^. If the difference between the CE of the 
original network and the randomized one is important 
(i.e. is not within the standard deviation from different 
randomized versions), it shows that the network has an 
internal cluster structure that differs fundamentally in 
terms of stability from a network where the nodes have 
been connected randomly. 

Before showing applications of our method to st udy 
the stability of the clusters, we briefly describe MCL [l2| 
that we used as a clustering technique. MCL is based on 
the idea that when a random walk on a network visits a 
dense cluster, it will likely not leave it until many of its 
vertices have been visited. However the idea of performing 
a random walk on a network does not immediately lead 
to the clusters, since as the time increases, the random 
walk will end up leaving one cluster for another. MCL 
favors the most probable random walks, already after a 
small number of steps, thereby increasing the probability 
of staying in the initial cluster. The algorithm works as 
follows: 1) take the adjacency matrix A of the network; 
add the self-edges (l's on the diagonal) and normalize 
each column of the matrix to one, in order to obtain a 
stochastic matrix W; 2) take the k th power of the matrix 
W, k £N (we used k = 2); 3) take the r th power of every 
element of W k (typically r w 1.5 — 2) and normalize each 
column to one; 4) go back to 2). After several iterations 
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FIG. 2: CE as a function of z, the average number of edges 
connecting a node from a given cluster to nodes of other clus- 
ters, for a network with 4 communities of 32 nodes. The error 
bars represent the standard deviation for different networks 
with the same Pi„ and p ou t- r = 1.85, a = 0.5 



MCL converges to a matrix idempotent under step 2) 
and 3). Only a few lines of the matrix have some non 
zero entries that give the cluster structure of the network. 
Note that the parameter r can tune the granularity of the 
clustering. A small r corresponds to a few big clusters, 
whereas a big r to smaller ones. 

To illustrate the principle of the comparison based on 
the CE, we apply it on the well-known benchmark net- 
work introduced first in 0. The network consists of 4 
communities of 32 nodes. The nodes are connected with 
a probability pi n if they belong to the same community 
and p ou t if not. Typically one chooses to vary pi n and 
Pout keeping the average degree of the nodes constant. In 
Fig. [21 we plot the CE of the network, z is the average 
number of edges connecting a node from a given cluster to 
nodes of other clusters (z = 96 • p ou t)- The average total 
degree is fixed at 16. When z is small the clusters are very 
well defined and most of the algorithms correctly identify 
them. As z increases, the clusters become more and more 
fuzzy and for z > 7 even the best currently available al- 
gorithms fail to recover the exact cluster structure of the 
network (actually the cluster structure tends to disap- 
pear from the network). This corresponds to the point 
from which the comparison of the CE does not allow to 
differentiate between the network and a randomized net- 
work. We stress that the clustering entropy does not 
make reference to the assumed partition of the network 
into four clusters that, given the statistical nature of the 
links, cannot be guaranteed for every realization. It is 
thus an objective measure of the stability of the network 
under clustering. 

Let us now turn to real- world networks. As a first 
example, we consider the "karate club network" built by 
Zachary |2l| . MCL correctly identifies the two communi- 
ties, which correspond to the actual division of the club. 
The only unstable node is represented with a diamond. 
This node is connected to four nodes of one community 
and five of the other one. From a topological point of 
view, it is absolutely justified to consider it as an unsta- 



FIG. 3: Zachary's karate club network. The two clusters are 
represented with two different colors. The unstable node is 
represented by a diamond, r = 1.8, a = 0.5 



ble node. The CE of the network is 0.14. The random- 
ized network has an average CE of 0.27±0.1 (average and 
standard deviation of 100 randomized versions). Thus on 
average the CE is significantly larger for the randomized 
network. 

We studied a linguistic network based on the relation 
of synonymy in French |22|. The nodes are the words 
in a given sense. Two nodes are connected if they are 
considered as synonyms. We applied MCL on the larger 
disconnected components of the network (up to 10000 
nodes) and found a much better lexical representation 
of the synonyms. The natural interpretation of unstable 
nodes in the case of a synonymy network is that they cor- 
respond to ambiguous words. As a validation of our re- 
sults, we can measure the clustering coefficient of the un- 
stable nodes. Averaging over the whole network, we have 
a clustering coefficient of 0.26 for the unstable nodes and 
0.45 for the stable nodes. Furthermore the betweenness 
[23| of unstable nodes is on average 1.6 times larger. The 
important difference was expected since unstable nodes 
often lie between clusters, and therefore usually do not 
have a large clustering coefficient, but have larger be- 
tweenness. Moreover the plot of the edge betweenness 
versus the probability pij shows that external edges have 
on average a larger betweenness (Fig. |3J| , which is con- 
sitent with the Girvan-Newman clustering algorithm [jj. 
Fig. shows how the CE varies with the parameter r of 
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FIG. 4: Edge betweenness versus pij for a component of 9997 
nodes from the synonymy network, r = 1.6, a — 0.5 
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FIG. 5: CE as a function of the parameter r for a network of 
185 nodes. The dashed curve is the average over 50 random- 
ized versions and the error bars correspond to the standard 
deviation, a — 0.5. 

MCL for a component of 185 nodes compared with a ran- 
domized version of the same component. For 1.3 < r < 2, 
the difference of behavior is striking. This shows that the 
clusters are not a by-product of the clustering algorithm, 
but correspond to a real community structure of the net- 
work. 

We finally applied MCL on the protein folding net- 
work of the anti-parallel /3-sheet peptide developed by 
Rao and Caflisch [24|. The network consists of almost 
80000 nodes. MCL correctly identifies the native state 
(or at least part of it) and other stable configurations 
such as the curl-like trap. Studying the stability of the 
clusters, we restricted ourselves to the network with cut- 
off (1069 nodes). We can again compare the CE. For 
r = 1.6 and a = 0.5 we have a CE of 0.12, while the ran- 
domized network shows an entropy of 0.3 ± 0.05 (average 
over 50 randomized versions). 

Note that the parameter a can in principle influence 
the results. With a close to 0, we cannot detect the un- 
stable nodes, while with a bigger than one, the topology 
of the network changes dramatically. However the results 
do not change significantly for a broad range of values of 
a around 0.5. For instance in the network displayed in 
Fig. the node 7 was identified as the only unstable 
node for 0.15 < a < 0.8. Moreover very similar results 
are obtained using a gaussian distribution for the noise. 

In conclusion, the introduction of the noise on the 
edges and the probabilities Pij provides a well-defined 
and objective way to identify unstable nodes and to deal 
with ambiguities in clustering. The method performs well 
on the small test networks presented above. As a vali- 
dation of our results for larger networks that can hardly 
be visualized, we have seen that the clustering coefficient 
of the unstable nodes is usually much lower than the av- 
erage clustering coefficient of the whole network. More- 
over these nodes have, on average, a larger betweenness, 
which is also expected for nodes lying between clusters. 
Nevertheless we could not have identified the unstable 
nodes only by comparing the clustering coefficient and 
the betweenness since very stable nodes may still have a 
large betweenness and a small clustering coefficient, and 



vice versa. The Clustering Entropy allows for a quanti- 
tative comparison between a network and a null-model. 
We have found that in many examples the difference was 
clear, assuring that the clusters detected by MCL are nei- 
ther the result of random fluctuations in the modularity 
of the network [20j, nor an artefact of the clustering al- 
gorithm. Finally, since the method does not depend on 
a particular clustering algorithm, it can in principle be 
implemented using any other clustering technique than 
MCL we used here. 
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